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ABSTRACT 


The purpose of this thesis is t) analyze the operation 
of a distributed databas® management system under network 
partitions, review a number of existing methods proposed to 
deal with this problem anid to present an alternate approach 
that will allow multiple operating partitions upon 
network partitioning. 

When a network that supports a distributed database with 
redundant data becomes partitioned, 2ach partition may func- 
tion separately. Due to this, indspendent updates at each 
partition may cause inconsistencies to arise. At network 
reconnection time such divergent data, in particular copies 
of the same data in different partitions have to be recon- 
Ciled. There is no known general method for doing so. 
Existing solutions are often unacceptable because system 
availability is reduced. Two recently proposed methods that 
allow continuous operation of multipl2 partitions may work 


for certain applications but ar2 not general endugh. 
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I. (€NTRODUCTION 

In the past decade there has be2n considerable work, 
done on multiprocessor systems and computer networks. As 
consequence of this work the concept of distributed 
computing systems was developed and is presently a focus of 
intensive research in acad2mia and Industry. 

In particular Distributed Data Base Systems (DDBS) haves 
became one of the more important research topics since many 
distributed systems are now being developed to provide users 
with convenient access to data via some kind of communica- 
tions network. 

A distributed databas2 system has the potential advan- 
tages of greater data availability and reliability since 
data-items may be replicated and accessed at several sites 
throughout «he systen. W2 use the term "potential" because 
availability should increase with th2 number sf copies of 
the data. If the multiple copies of data were read-only 
then availability will, in fact, be increased, however, when 
updates are also allowed, multipl2 copies may provide no 
improvement if mutual consistency among ccpies of the data 
is emphasized. 

Mutual consistency requires, that if all update activity 


were to cease, then after some period of time all copies of 





the same data will converg2 to the same value. There have 
been many algorithms published for maintaining mutual 
consistency gece normal operation of a distributed data- 
base Es age el | rie Unfortunately, these 
algorithms do not considec mutual consistency in the face of 
network partitioning. 

A network partition occurs when two or more disjoint 
subsets of sites in the network cannot exchange messages 
through the network (1.8. cannot communicate with each 
Other) even though some or all of them are up and opera- 
tional. A special case of network partitioning occurs if 
the only path between two or mor? sites is the communica- 
tions network. In this case a single site crash cannot be 
distinguished from a network partition that separates that 
site from the rest of the network. 

Network partitioning can completely destroy autual 
consistency in the worst case and s> the usual solution to 
deal with this problem has been to restrict operation during 
network partition in such a way that only one group of sites 
(l.e. within one partition) is allowed to do the updates. 
The basic idea behind this approach is that no update scheme 
is effective against partitioning in guaranteeing mutual 
consistency of data, unless data is always kept accessible 
omy in one partition f19}, 18] . The methods proposed 
vary inthe way in which they select the set of sites 


allowed to do the updates. 





However, these kind 29€f schemes have as a major drawback 
that it may be unacceptable for the non-selected sites to 
shutdown operations while the network is partitioned. We 
must note that it is worthwhile to have all partitions in 
Operation if (1) availability 1s just as important as 
consistency and (2) "conflicts" among copies of data can 
always be succesfully reconciled (aither automatically by 
the system, or by a user) when communications are reestab- 
lished and network returns to normal operation — 16]. 

It is necessary to realize that network partitions ars 
not due exclusively to communications failures or site 
crashes. Networks can b2 interrupted for tactical reasons 
(as when a warship decretes radio silence *5 avoid enemy 
detection of radio waves) or simply for econonical reasons 
(a corporation batches nessages to be transmited over 
different periods of tim2 to attain lower communications 
costs). 

The goal of this thesis is to analyze and 2valuate some 
of the proposed methods for dealing with the network parti- 
tioning problem and to give some useful ideas towards the 
solution of this problem, especially when availability is 3 
prime consideration in the design of the system. 

Chapter 2 presents some basic concepts that will be 
useful for a better understanding of the following discus- 
sion and also presents some problems and issues that will 


arise when the network partitions. Chapter 3 presents 3 
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survey of methods proposed to deal with network partitions, 
placing special emphasis on two of them that allow non-stop 
operation of partitions. In chapter 4 we present an alter- 
nate approach to continuous operation of partitions based on 
precedence graphs. We also present the algorithm required 
to detect conflicts and reconcile the database at network 


reconnection time. 
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A. INTRODUCTION 


The best way to describe the problem presented by 
network partitions is by giving an example. Suppose we haves 
a network composed of thr22 nodes A,B and C, and nodes A and 
C have a copy of data-object X, which may contain for 
example a record of the savings account of certain person in 
a Bank. Suppose that th2 communications are ilaterrupted in 
such a way that site A can communicate with site B but none 
of them can communicata with site Cc and thus dividing the 
network in two partitions P1 (formed by nodes A and B) and 
P2 (formed by node C). In this cas2 both partitions P1 and 
P2 have access to data-object X, but if we allow both parti- 
tions to independently update dat2-object {X, they may 
perform inconsistent updates to it. This will happen due to 
the imposibility of sending update messages through the 
interrupted communications line. 

Now, putting ourselves in the worst case, assume the 
Savings account of the parson mentioned abov2 has’7 10,9000 
dollars and the person is not very honest. Ahen he knows 
about the partition he goes to node A and retrieves all thse 
money in his savings account, and immediately goes to nodec 


and does the same operation. 
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Thus he will have 20,900 dollars in his hands and the 
bank will have in each partition data-object X with the same 
value of 0 dollars in th2 savings account record. When 
communication is reestablished between the thr2e nodes and 
reconciliation is done, we will have a negative savings 
account record for data-object X and the problem of having 


to recover that money. 


Be BASIC DEFINITIONS AND CONCEPTS 


In this section we will review the basic concepts that 
will be needed in the rest of the thesis. A more complete 
discussion of these concepts can be found in [7]j, [10], 
tet j. 

A distributed database system is a collection of named 
data-objects. Each object has a name and m values associ- 
ated with it, where m =< n and nis the number of sites in 
the system. The sites ar2 interconnected by a network and 
each site rms two software modules: a transaction manager 
(TM) which supervises th2 execution of transactions; and a2 
data manager (DM), which processes rad and write operations 
on the data stored at the site. 

A logical database is a set of logical data-objects. A 
copy of a logical data-object storei at a site is called a 
physical data object. Logical data-objects will be denoted 


by uppercase letters 1.2. X, and phySical data-objects will 


13 





be denoted by lowercase letters 1.e. X,...Xm. The set of 
all physical data-objects stored at a site is called the 
database of that site. 

Operations on data ar? grouped into transactions. a 
transaction is a program that accesses the database by 
issuing read and write operations on logical data-objects. 
In the read case its TM selects one copy of the data-object 
and issues a read operation to th2 DM that manages that 
object. In the write case the TM issues a write operation 
for every physical copy of the logical data-object. 
Transactions are the units of consistency and recovery. 
They can be viewed as larger atomic actions on the system 
state which transform it from one consistent state to a new 
consistent state. Transactions preserve database consis- 
tency because if some atomic action of a transaction (i.e. 
a Read) fails then the entire transaction is undone 
returning the database to a consistent state. 

A transaction is maie atomic by use of a commit 
protocol. A commit is an unconditional guarantee to execute 
the transaction to completion, even in the event of fail- 
ures. An abort is an unconditional guarantee +o back out 
the transaction. The problem of juaranteeing transaction 
atomicity in a distributed system is that of insuring that 
all the sites either unanimously abort or unanimously 
commit. After the commit the new value is made available to 


all other transactions. 
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Concurrency control is the activity of coordinating 
transactions that access a database concurrently. The goal 
is to prevent concurrent transactions from interfering with 
each other, so that every transaction sees a consistent 
database state. Inconsistencies may arise because trans- 
actions, which are the user's atomic operations have a 
coarser granularity than actions on objects which are the 
atomic operations directly supported by the underlying 
system. If several transactions execute concurrently, their 
actions get interleaved in an arbitrary way, allowing data 
inconsistencies to arise. Concurr2ncy control mechanisms 
typically use locks to regulate access to shared resources. 
The lock is a serialization mechanism which insures that 
only one transaction at 2 time is using a specific object. 
The lock notifies other transactions that the object is 
currently being used and protects the requestor from other 
transactions trying to nodify the object. 

A formal definition of database consistency is based on 
the notion of a serializable schedule. A schedule is any 
sequence of actions performed by a set of transactions on 
database objects. A schedule is serializable if it is 
equivalent to a serial schedule, that is, to a schedule in 
which transactions executs serially, one after the other 
with no concurrency. 

A schedule is consistent if and only if it is seriali- 


zable. Generally serializability is obtained by requiring 
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that each transaction in the schedule be two-phase and well 
formed. A transaction is two-phase if it never locks any 
data after releasing some lock. HE as well tonumed” i1f it 
always locks in exclusive mode any data that it writes and 
locks in shared mode any iata that it reads. In order to 
facilitate easy recovery it is required that all the locks 
be released at the end of the transaction. 

A log (sometimes called audit trail or journal) is a 
history of all the actions of transactions on recoverable 
objects. Fach action which modifies a recoverable object 
writes a log record giving the old and new values of the 
updated object. Read operations need generate no log 
records, but update operations must record enough informa- 
tion in the log so that given the record at a later time the 
operation can be completely undone or redone. These records 
will be aggregated by transaction and collected in a common 
system log. The log is desirable because we want to be able 
to commit or undo updates in a per-transaction basis without 


affecting other transactions. 


C. PROBLEMS AND ISSUES 


In this section we present some problems and issues that 
Should be considered when dealing with the problem of 


network partitioning. 
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When a network partition occurs we have three basic 
alternatives: 

(1) Halt all transaction processing in ‘the partitions 
until the network is completely reconnect2d again. 

(2) Allow one partition to process transactions that 
update data-objects while the rest may accept read- 
only transactions. ! 

(3) Allow all partitions to continue operating "in 
parallel" during partition and reconcile the databases 
at partition merge. 

We could consider two more alternatives. First, to delay ail 
transactions during the partition, and second, to execute 
Pieeeccansacts0ns and ©hen rcoll-back the entire data-base 
reexecuting again all transactions after partition ends. 
These alternatives are not consider=d because we would be 
better off if we simply use alternative (1). Clearly alter- 
Native (1) 1s not reasonable since we have 25 one of the 
advantages of a distributed systen its increased avail- 
aoe Lity. Halting transaction processing in all partitions 
will be contrary to the ljea of having replicated data to 


make data accessible after failures. 


tThe user should receive a warning which alerts him of 
the possibility that the values may be out of date. 
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Alternative (2) seems more practical and is in fact 
usually taken as a reasonable compromise. Most of the 
methods proposed to deal with network partitions allow only 
one group of sites to process transactions OOF wie. ae 
Allowing only one partition in operation facilitates 
recovery after the partition sinc2 to reconcile the data- 
bases it is only required that sites inthe non-active 
partitions perform all th2 updates they missed. The only 
problem in this approach is to guarantee that at most one 
group of sites processes transactions. In chapter 3 we 
review some of the methods proposed in order to achieve this 
objective. However, thes2 approaches may be unacceptable to 
those sites that must remain non-active during partition 
when availability is highly desired. 

The third alternative, t9 allow all partitions to 
process transactions, should be th2 goal of 43 qe eetnaced 
system where availability is one of the primary concerns. 
However, these are som2 serious problems in allowing 
"parallel" operation of partitions. As e?ach partition 
processes different transactions and stores different values 
into the databases, the values of th2 data-objects stored of 
sites in different partitions will diverge and database 
reconciliation is required when the network is reconnected. 

In order to make the databases consistent after 
partition we can use two strategies. The first strategy is 


to undo transactions that made conflicting updates to data 


18 





objects. For example, assuming two two partitions, at 
partition merge transactions in different partitions that 
updated physical copies of the same data-object' are detected 
and some of them are undon:. The value of the data-objects 
in the partition where th2 transactions were undone is made 
equal to the value updated by transactions that were left. 
An important consideration is that each transaction that 
read the values updated by a transaction that was undone, 
should be also undone. This requires a detailed log and the 
necessary overhead to detect conflicting transactions. Also 
the users that executed transactions during partition will 
not know if the values produced by their transactions ar2 
walaid Or not until “partition is corrected. 

The second way of achieving nutual consistency after 
partition is to use semantic knowledge in order to "inte- 
grate" the values of diverging data-objects {19]. This isa 
very difficult problem and has bean discussed in detail by 
Faissol? in [ 8]. For example, an object r in an airline 
reservation system indicates the number of available seats 
mma fiight. If after the partition values of object r are 
vi and v2, then the correct value of ris given by v1 + v2 
Minus the value of r before the partition. Note that if 
the reconciled value is negative then reservations will have 


to be cancelled with th2,. consequent discomfort of some 


2By the use of partitidnable integrity assertions. This 
1s discussed in chapter 3. 
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affected customers. Obvisusly, special measures should be 
taken in partitioned mode Xperation t) avoid these problems. 

As we can see we have to pay a high price in order 
to assure increased availibility o2f data during network 
Parc ition. However, there are some circumstances that 
lessen the overhead which would be incurred in detecting and 
solving conflicts otherwise. For 2xample, it has been 
pointed out in [5,6] that ina larg= class of applications 
most transactions require little orf no synchronization at 


all because they will never interfers with each other. 


2. Correct Operation Inder Network Partitions 


In order to provide correct operation of a distrib- 
uted database under a network partition there are three 
aspects that should be observed: 

(1) Preservation of mutual consist2ncy. 
(2) Compliance with integrity constraints. 
(3) Control of external actions. 

Within each partition mutual consistancy between 
copies of data-objects at different sites is preserved using 
concurrency control methods in the same way as they would be 
in a connected network. Therefore, each copy of the data- 
base in different partitions is internally consistent. 
However, Since there is no communication between partitions 
the transactions in each partition will run without coordi- 


Nation between them and wa may end ip with different values 
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of the modified objects. That is, if the same logical3 
data-object is modified by transactions in different parti- 
tions, then we will hav2 a globally inconsistent state. 
Note that even if the value of the same data-object in two 
partitions is equal we cannot assume that the correct value 
is the value stored at both partitions. For example if our 
bank account balance is 5000 dollars in each partition and 
it is debited equally with 2000 dollars we will find at 
partition merge that both balances are 3000 dollars. 
However, this is not the sorrect value since if both trans- 
actions would have been executed with a connected network 
the final value of the account balance would be 1000 
dollars. 

Assuming we do not know anything about the seman- 
tics* of updates applied to data-objects we can solve the 
inconsistency that arose in the example above, at partition 
merge by first, detecting the conflicting +*ransactions in 
both partitions and second, reconciling the two copies of 
the data-object. Reconciliation will require that one of 
the transactions be backed out, then forward the update of 
the remaining transaction to the other partition and finally 
to execute the backed out transaction in both eo eon: 


Figure 2.1 shows the process. 


3See definition in section B. 
*As is the case on the method we develop in chapter 4. 
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Partition (P1) Partition (P2) 

T1 reads balance = 5000 T2 reads balance = 5000 
T1 writes balance = 3000 T2 writes balance = 3000 
( ( 
| { 

oo ro Onno ng = <== —————— 


Conflict is detecte3d 


| 
| 
¥ 


Back-out T2 from P2 
T1 writes balance = 3900 in P2 
T2 reads balance = 3000 in P1 and P2 
T2 writes Balance = 1090 in P1 and P2 


wee” Fo 
Pe oa. A ee 


Figure 2.1 Restoring Mutual Consistency 


Of course, it is not always this simple, and we will 
have to make several considerations to restore mutual 
consistency in more complicated cases. However, the main 
idea behind this example is that it is possible to restore 
mutual consistency in a database that has been idependently 
modified in different partitions and so mutual consistency 
can be preserved. 

Integrity constraints had bean classified in [2] as 
operational constraints and semantic Constraints: 
Operational constraints are those related to the preserva- 
tion of database integrity against inconsistencies that 


arise from the concurrent 2xecution of several transactions. 
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As we have seen the concurrency control mechanisn 
will assure that operational constraints are not violated. 
Semantic constraints are those related to the preservation 
of the database integrity against inconsistencies that arise 
from violations of what data is supposed to mean. For 
example, in a record for a course containing fi2lds: Examn%, 
Homework % , Labs % indicating the percentage of the grade 
devoted to each of them, we would expect that the sum of the 
values of the fields is 109. 

Onless we use semantic knowledge to implement an 
approach to continuous operation of partitions as in[(8], 
the requirements for compliance with operational and 
semantic constraints are the same in each partition as the 
ones in the completely connected network. 

In addition to the database contents, external 
actions may have been performed in response to a transaction 
and some of these cannot be reversed. For example 
dispensing cash to a customer 1s in theory an irrecoverable 
external action. Under a network partition the problem of 
allowing external actions becomes amore complex because of 
the independant execution of transactions in different 
partitions. External actions must be restricted when oper- 
ating in partitioned mode unless we can reverse the external 
action by some kind of compensation. Por example a message 
send to a terminal must be followed by a validation note. 


If the validation note is not received then the user will 
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know that the message rec2ived may not be valid and decide 
what action to take. If we cannot give a compensation for 
an external action then it should not be allowed. For 
example we should not allow cash dispensing since we cannot 
compensate for it. However, one of the partitions may be 
allowed to execute external actions provided that this 
action is not repeated in other partitions at partition 


merge. 
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III. PREVIOUS WORK ON PARTITIONING 


Ae INTRODUCTION 


In this chapter we will review previous methods proposed 
to deal eye network partitions. AS we saw in chapter 2 the 
most used alternative was to allow only one partition to 
update the database. All other partitions were to stop the 
updating activity on their databases in order to facilitate 
the database reconciliation at partition merge. 

Section B presents a v2ry brief review of some of these 
met hods. We do not spend too much time analyzing them 
because availability is significantly restricted and we are 
more interested in continuous operation of the different 
partitions under partitioning. Also none of these 
approaches openly states how conflicting versions of data- 
objects are detected or what is to be done with them upon 
partition merge. 

We are specially interested in high availability of 
data, so the methods presented on section C which allow 
non-stop operation of partitions will be presented in more 
detail.They are two recently proposeji methods, the first one 
uses the version vector mechanism in order to detect file 


conflicts. This approach is more suitable for an operating 


system environment. The s2cond approach is based on semantic 
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knowledge about operations on th2 data stored in the 


distributed database. 


Be. APPROACHES INVOLVING ONE OPERATING PARTITION 


1. Yoting 


There are some proposed Voting based systems in the 
literature ({15], [9]. There are two ways to implement 
these Systems. The first on? is nore suitable for fully 
replicated databases. Here each sit2 is assigned a weight 
(a number of votes). When a partition occurs the sites in 
the partition with a majority of votes are allowed to 
process the transactions (1.e. updat2 data-objects). Sites 
in other partitions go down or are allowed to process read- 
only transactions. The advantage of this approach is that 
if a user has access to a site which is up he has access to 
the entire database, however users that have access only to 
down Sites are restricted to read the data only. 

The second implementation {§9] is more general in the 
sense that it does not reguire a fully replicated database. 
The users desiring to modify an object must lock it by 
obtaining a majority in a vote. That is, updates are only 
allowed if a majority of sites vots to allow the update. 
Since there can be at most one partition containing a 
majority of sites, any object will be updated in at most one 


Ber tition. However, it may happen that ther2 will be no 
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partition which contains a majority 2f sites so in this case 
updates are not allowed in any partition. 

Consistency is easy to preserve since at partition 
merge minority sites would receive the missed updates and 
apply them to their copies in time stamp order. This is a 
clear example where mutual consistency is guaranteed at 
expense of availability. A disadvantage of voting in 
general is that it may be unacceptable for the minority 


Sites to be prevented from operating during a partition. 


2. Tokens 


In this approach it is assum2d that each data-object 
has a token which can be passed from copy to copy. Only 
Sites in the partition containing th? token are permitted to 
modify the object. In other words if the token for every 
data-object accessed by a transaction resides at some sit:3 
in a given partition then the transaction may be executed in 
that group, so using tokens might be less restrictive than 
using voting. 

This approach seems to be bast suited for a file 
system, where transactions access a single data-object. The 
problem of having transactions accessing more than one data 
object is that there nay be transactions that cannot be 
executed at any site since the necessary tokens are in 
different partitions. Also a disadvantage is that tokens 


can be lost (i.e. in a hard crash), and the problem of 
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recreating tokens is nontcivial. Furthermore there is a 
danger of making a resource unavailable if when the parti- 
tion occurs the token was ina very rarely used part of the 


network. 


se Cimary Sites 


This method was originally proposed in °1]. In this 
approach each data-object has a primary site which is 2 
Single site that is to be appointedresponsible for an 
object's activities. Transactions are executed in a parti- 
tion if it contains the primary sites of all the objects in 
the read and write sets of the transaction. 

This approach may provide better availability 
(Similar to the token approach) than the voting scheme, but 
also suffers from the some problems with respect to sites in 
other partitions, that is, it may be unacceptable for them 
to operate without updates. 

Note that the idea of primary sites and tokens is 
the same, but in this case the "token" cannot move around 
and thus cannot be lost. However, the token approach offers 
more flexibility because the "primary site" may vary dynami- 
cally as required. Also a disadvantage of the primary sites 
approach is that upon partitioning if a primary site was 
involved in a site crash and a backup site is elected as the 
new primary site then consistency problems can arise since 
the information stored in the original primary site is not 
available to the backup site (1l.e recent updates). 
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4. Reliable Networks 


This approach was adopted in SDD-1 [5]. In this 
system all possible transactions are divided into classes 
with variable synchronization levels. The classification is 
made a priori and requires the knowledge of the allowed 
operations and their semantics. Conflicts of transactions 
of the some class are avoided by a technique called pipe- 
lining based on the assumption that in most applications the 
operations in a database are known a priori and that most of 
them do not conflict. What pipelining does is to allow only 
one transaction of each class to execute at atime ina 
global time stamp order. 

Communications in the SDD-1 are based on the use of 
a "reliable network" [12], which guarantees that messages 
are going to be delivered 2ventually, even when a partition 
occurs. Messages are saved in "spoolers" to be transmitted 
following a break in communications. In the case of a 
partitioned network, nonconflicting classes can clearly 
Operate, but the solution for conflicts within classes 
Clearly can't be implemented due to the lack of communica- 
tion among partitions, which prevents the exchange of 
messages necessary to pipeline the transactions. Thus no 
guarantee of post-partition consistency exists because 
nothing is done to prevent conflicts between transactions 


when the partitions merge. 
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C. APPROACHES INVOLVING MULTIPLE OPERATING PARTITIONS 


1. Version Vector Mechanism 


This approach was first presented in ‘°16j) and was 


used in the design of LOCUS, a local network operating 


system at UCLA. It was intended for automatic detection of 
mutual inconsistency between files upon recovery from 
network failures and specially upon partition merge. 


However, the results did not generalize to transactions that 
accessed more than one file. 

Parker and Ramos {17] extended the "version vector" 
mechanism originally used to implement this approach so as 
to detect inconsistency wh2n more than one file is used by a 
transaction. Is important to note that this approach is 
intended primarily for an environment where file updated 
rates are moderate and conflicts occur only rarely. In this 
subsection we are going to give a detailed presentation of 
this approach. 


a. Preliminary Definitions 


In this subsection we present some definitions 
which are required in order +o understand this approach. An 
Origin point OP(f) of a file £ is a2 global unique identi- 


fierS which is assigned t> f when it is created. Although 


SFor example the pair (time of creation,site of 
creation) 
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f's name can change the origin point remains as an immutable 
attribute. Note that two files based on a coamon one can 
have the same origin point. 

A name conflict occurs when two or more files 
from the various partitions have the same name but different 
Scingin point. A wersion conflict occurs when two or nore 
files have the same origin point but different names and/or 
different contents. Ay fie conflict is detected after a 
partition if either a name or 2 version conflict is 
detected. To restore mutual consistency file conflicts must 
be reconciled so that file names again uniquely identify a 
Eas 2s 

A partition graph G(f) fOr: -ia: Tike fu vis. 7a 
directed acyclic graph (DAS) which is labelied as follows: 
The source and sink nodes are labelled with the names of the 
Sites in the network that contain copies of fille f£. Each 
node can only be labelled with sit names appearing on its 
ancestors and each name in a node appears on exactly one of 
its descendants. 

A version vector for a file f is a sequence of n 
pairs where n is the number of sites that store f. The i-th 
pair (Si:Vi) counts the number Vi of updates to £ made at 
site Si. A set of version vectors are compatible when one 


vector is at least as large aS any other vector in every 


Site component for which they have entries. A version vector 
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is an encoding of the partial order® describing the set of 
updates made at various sites. Ind2pendant updates leading 
to incomparable versions in the partial order, have incompa- 
rable vectors as result. A set of vectors conflict when they 
are not compatible. 

For example, suppose that file f is stored in 
sites A and B . Initially the version vector associated with 
f will be <A:0,B:0> every time f is nodified in one site the 
version vector will chang2 accordingly. If f is modified in 
B then the new version vector will be <A:0,B:1>. The version 
vectors <A:0,B:1> and <A:1,B:0> conflict because no vector 
dominates the other. 

An execution graph G = G(T1,...,Tn) is a DAG 
with nodes C0O,L1,C1,...-Ln,Cn,Ln¢1 where Li is the lock and 
Ci is the commit operation of transaction Ti repectively. C0 
initializes all files and Ln+t1 reads all files? The edges of 
Gare pairs where either x=Li and y=Ci or y reads what x 
writes. 


b. Description of the Approach 


In the case of multiple file conflicts, version 
vectors alone are not sufficient to detect conflicts, thus 


an additional mechanism is required in order to achieve this 


®A partial order is a binary relation which is symmetric 
and tranSitive. 


*These are dummy transactions used to give symmetry to 
the graph. 
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goal. Conceptually, non serializability can be detected by 
means of a precedence graph. A precedence graph is composed 
of an execution graph of a schedul2= of operations and all 
edges formed by operations with intersecting read and write 
sets. If a precedence graph is acyclic then the execution 
graph within it is serializable. 

A set of files S is put into conflict if there 
exists an schedule of transactions I[1,..., Tn whose execu- 
tion graph is not serializable and one or more files in S 
are also in the readset of any of the transactions of the 
schedule. If S is put ints conflict then the version vector 
sequences for the sets S$1,...Sn will be incompatible. Note 
that the sets S$1,...Sn are the readsets of the schedule of 
eeamsactions T1,..«.,TN« 

With these concepts in mind if we want to detect 
file conflicts for f, we must check all transaction sets of 
files S$ containing f for serializability errors. A way +9 
do this is to have a 19g where all the readsets of the 
transactions that had been executed are stored. An opera- 
tion called extent (f) is iefined to obtain the set of files 
that are involved with f by some of the readsets stored in 
the log. In mathematical notation: 

extent (f) = (g9 / (f,g) 1s in ® + ) 
where Rt is the transitive closure of the celation R 
and R = ( (f1, £2) / there is an S in the log such that 


(f1, £2 ) is a subset of S ). 
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In plain notation extent (f) is the set of files 
that in one way or another are related to f in the readset 
of transactions stored in the log. For example if the log 


contains: 


Read set (T1) = [ £, £3, f4 }j 
Read set (T2) = [ £3, £4, £5 } 
Read set (T3) = [ £1, £2 } 
Read set (T4) = f{ £3, f£6 } 


mhen extent (f) = [{ £, £3, £4, £5, £6 }j. This is because 
the transitive closure implies that since £3 and £4 were 
related with £ in Readset (T1) then any other file related 
with £3 and £4 is going to be also related with f and so on. 

Two important consequences of the extent defini- 
tion are that a file is put into conflict if and only if its 
extent is put into conflict and that extent divides the set 
of all files into equivalence classes. In the example above 
note that extent of £5 is [{ £, £3, £4, £4, £5, £6 }] and thus 
extent (f) = extent (£5). 

Stored values of the 2quivalence classes and 
their version vector is all what is needed in order to 
detect multiple file conflicts. Th? stored se« of classes 
is called a log filter. The algorithm present2d for multi- 
file conflict detection is as follows ( LF crtepresents the 
wog filter) : 


(1) LPF = null 
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(2) Repeat steps (3) to (4) each time a transaction 
commits 

(3) If the readset sf the transaction is contained in 
some set S*' in the LF then attach the version vector 
sequence corresponding to the files in the readset of 
the transaction with null vectors as place holders 

cy ff S is not already contained in LF, incorporate §S 
and its corresponding version vector sequence to LF 
using the fast union-find algorithm 

(5) To check if a file is in conflict get extent of the 
file in the log filter and see if it has incompatible 
version vector sequences. LE 2t has, then return 
COnELict . 

Note that instead of keeping a list of sequences 
of version vectors for every update mode in the system, log 
filters are used to reduce the onumber of sequences of 
version vectors the system needs ¢9 store as log informa- 
ecko )sae™ fhat is, in order to detect conflicts it is only 
needed to store those seguences which are not dominated by 
any other sequence. 

The conflict resolution policy presented by 
Parker and Ramos is based on the notion of a transaction. 
Any file update operation must be within the transaction 
(between the begin and end statements). A get statement is 
defined which informs the system abdut which files the user 


plans to use. This get statement will check if all the 
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files given to it are consistent. If there exist a file in 
conflict then the transaction is not 2xecuted. Each file 
specified in the get statement is locally locked for the 
duration of the transaction. When the transaction ends the 
system updates the log filter and the updates are commited 
Simultaneously. Ei samttle. is £ound to conflict at this 
moment then it should be rolled-back. The transaction 
completes and all its locks are released. 

The proposed approach night be seen similar to 
"optimistic" concurrency control {4}, (14 ] where 


conflicts are detected during and/or after the transactions 


execution. It could be used for partition handling for 
these concurrency control mechanisms as _ follows. When 
working in a partitioned node, the users are notified of 


file conflicts whenever 23 transaction is started and the 
partition is being merged or is already merged. Once a file 
conflict is detected no updates are performed on that file 


until it is reconciled. 


This approach was ievelopei by Faissol °“8] and is by 
far the most complete presentation of a method to deal with 
the network partition problem. The approach is based on the 
use of semantic knowledge about the applications in order to 
allow updates in independent partitions. Database opera- 


tions are divided into classes of semantics in order to 
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reduce the amount of semantic information that must be 
supplied to the DBMS. Each class shares a common merge 
algorithm and information gathering routines. 

a. Semantics of Operations and Database 


Reconciliation 


Semantic information is supplied to the DBMS in 

three forms: 
(1) The class of semantics of each operation. 
(2) A set of integrity assertions for each operation. 

(3) The program code for each operation. 
With this information, the DBMS will appropriately modify 
the behavior of the user dperations under a network parti- 
tion in away that guarantees that reconciliation can be 
made automatically upon partition merge. In this approach 
the applications programmer will be in charge of extending 
the requirements for semantic integrity ina way that 
allows partitioned mode operation. Semantic integrity is 
provided by amix of integrity assertions and strong data 
types. Integrity assertions are used mainly for those 
constraints that may vary with the ocurrence of network 
partitions. Two sets of these assertions are specified one 
for normal operation and one for partitioned mode. They 
Will be automatically enforced by the DBMS depending on the 
Status of the systen. Inly when trrecoverable external 
actions are involved it is necessary to restrict user's 


actions by having more strict integrity assertions. 
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Operations are divided into five classes of 
semantics depending on thelr properties. The first class of 
semantics (class A) involve operations defined as replaces 
value and update single objects, that is, those operations 
that update an object without examining the database and 
have no associated integrity assertions. An example of 
these operations are the update of names and addresses. fhe 

second class of semantics (class B) involve operations that 
are compressible, commutative, update single objects and 
have partitionable integrity assertions. A set of opera- 
tions on an account are compressibl2 since we can replace 
several credits by only one credit that is equivalent to the 
rest. Two Operations are commutative if the order in which 
they are executed can be chaged producing an equivalent 
schedule. For example credit and debit operations on an 
account are commutative. An integrity assertion is defined 
as partitionable if we can derive from it a set of integrity 
assertions, one for each partition, such that if each asser- 
tion is satisfied in its respective partition then the 
original assertion is satisfied at partition merge. The 
third class of semantics (class C) involve operations that 
either are commutative and invertibl>s or are commutative and 
have partitionable integrity assertions. An operation is 
invertible is there exists another operation which will 
restore the database to th2 initial state, that is, to the 


value it had before the 2xecution of the first operation. 
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‘ 1a 


For example the debit operation can be inverted if it is 
followed by a credit for the same amount. Note that if 
operation Oi is invertible by operation Oj and is commuta- 
tive with all operations in a schedule then it is invertible 
even if there exist some operations between O1 and Oj in the 
schedule. The fourth class of semantics (class D) involve 


operations that are invertible. Finally, the £L£th class of 
semantics (class E) involve operations that do not contain 
irrecoverable external actions. 

As we have s@en these s2mantic classes go fron 
the most simple operations to the most complex. An impor- 
tant restriction that must be mentioned is that the 
invertibility property of operations implies that no irre- 
coverable external action nay be allowed. In order to store 
information necessary to the reconciliation .algorithm a 
history type is defined for each class of semantics. The 
set of all history objects creat2i in one partition is 
defined as the partition history. A partition history will 


in general contain objects of various classes. ft Sis 


created when the partition occurs ani it is del2ted when all 


merges are complete. The first three history types are 
stored as a set, that is no ordering is defined. The two 
last history types are stored as sequences. Figure 3.1 


summarizes the class of semantic of each operation and the 


information necessary to store in 2ach partition history. 
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Figure 3.1 Semantic classes and histories 


AS we mentioned before, operations with class A 
semantics involve the lowest overhead and are the nost 
restricted. Objects can be reconciled simply be choosing 
the value in one partition and installing it in all others. 
Note that only the last modification of each object is 
reguired to be stored in the partition history for class A 
operations. Operations »9f class B have little overhead 


also. Reconciliation of »sbjects can be made independently 
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for each object by a single operation that summarizes all 
the updates made in the other partition. Since it is not 
required that they be invertible they may contain irrecover- 
able external actions. Jperations with class C semantics 
can modify several objects at a tim2 and therefore, recon- 
Ciliation of the database is mad@ on an operation by 
operation basis. At partition merge, each operation that 
ran in one partition is executed in all the others. 
Operations can be execut2d in any order since they are 
commutative. The subclass of semantics C1 allows irrecover- 
able external actions but subclass C2 is not allowed to 
execute these kind of actions Since they are invertible. If 
some integrity assertion is violated by an operation of 
class C2 then some operations are inverted until a consis- 
tent database is obtained. Jperations with class D 
semantics are more compl2x since they must be executed in 
order in all the other pactitions at partition merge. In 
this case conflicts may arise because of integrity asser- 
tions violations or because operations that involve the same 
data-objects were execut2d in different orders in each 
partition. To reconcile the databas2= conflicting operations 
must be inverted, taking care of inverting also operations 
that read values produced by inverted operations and then 
reexecuting these operations in all partitions. Clearly the 
partition merge algorithm is more complex. Operations with 


class E semantics include all operations except those with 
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irrecoverable external actions combined with nonpartition- 
able integrity assertions. To reconcile the database aout 
fied by operations in this class it is necessary to undo 
these operations by restoring the "before" images of all 
modified objects taking th2 same precautions as with opera- 
tions of class D semantics. 

It is important to be aware that there exist 
some type of objects called by Faissol “critical types" 
which cannot be handled by this approach to partitioned mode 
Operation. For example a bank stop payment order. Failure 
to handle these kind of cases automatically does not invali- 
date the method since they are infrequent e2nough to be 
handled by extraordinary means (1.2. by telephone). 


b System Operation 


This approach assumes that a concurrency control 
mechanism exists in each partition to handl2 concurrent 
execution of transactions. Also it is assumed that a 
recovery mechanism removes the effects of a system crash 
from the database. System and applications software are not 
directly available +o the users of the database system, who 
interact through a set of pre-defined transactions. Systen 
Operations are added to those suppli2d by the application in 
order to enforce semantic integrity and to allow partitioned 
mode operation. When th2 entire network is connected the 


system is in normal operation and all the copies of «he 
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replicated database are mutually consistent. Every time a 
transaction is submitted, a STATUS operation checks a 
PART-~FLAG object which is a system defined type and contains 
information about the state of the network. If the result 
of the check is "network conected", then the user operation 
is executed. It is followed by a chack operation on each of 
the integrity assertions. If all assertions are satisfied 
then the irrecoverable external actions are started (if they 
exist) andthe transaction terminates. Ge Siemescer sens 
are not satisfied then tha transaction is aborted. Note 
that in normal operation, the only additional overhead is 
the status operation because it is always required +o main- 
tain semantic integrity. I[f status returns "partition merge 
in progress" the DBMS must check 1£f the operation can be 
executed. This depends on the class of semantics to which 
the operation belongs. For example, operations with class A 
semantics have to check if the target object is not locked 
by the merge algorithm, while operations of class E seman- 
tics have to check if their read and write sets do not 
intersect with the read and write sets of remaining opera- 
tions in each partition HISTORY, in order ere executed. 
If status returns "network partitioned" the appropriate 
information is stored in the partition history for the 
class of semantics of the operation. If the operation is 
not within one class of semantics allowed to cun in parti- 


tioned mode then it is rejected, otherwise the operation is 
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executed and a check of integrity assertions is performed. 
When two or more partitions merge, a system process performs 
database reconciliation using the information stored in the 
partition history for each class of s2mantics. A different 


merge algorithm is invoked for each class of semantics. 
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A. INTRODUCTION 


In chapter 3 we reviewed previous work done on network 
partitioning. We were particularly interested in analyzing 
two recently proposed methods by Parker {17] and Faissol [8] 
because they allow non-stop operation of each partition and 
reconcile the conflicts at partition merge time. 

Since our objective is to attain high availability of 
data in the distributed database we will concern ourselves 
in this chapter with the development of an alternate 
approach, that will also allow continuous operation of the 
partitions during network partition. 

The approach proposed in this chapter relies on prece- 
dence graphs in order t29 detect conflicts [20] and on 
Serializability as the correctness criteria for database 
reconciliation. 

In order to make basic concepts nore understandable «he 
discussion that follows assumes that there are only two 
partitions and that during partition merge all the opera- 
tions on the database are suspended. In later sections we 
relax these constraints and present some extensions to the 


approach. 
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Be DESCRIPTION OF THE APPROACH 


and Assumptions 


It is assumed that within one partition there is a 
mechanism to provide concurrency control and atomic trans- 
actions. A number of such mechanisms have been described in 
the literature baal (Wl, fetSaie Y £18 Jie Therefore we 
assume that the system operates as 1£ only one transaction 
is executed at a time and that rejected transactions have no 
effect on the database. It is also assumed that if a system 
crash occurs in the midle of a transaction, the recovery 
mechanism will remove its 2ffects from the database. 

Por the rest of the chapter transactions will have 
the following structure: 

(1) A transaction T wishing only to read a logical data 
object X, executes a Read-Lock X, which prevents any 
other transaction from writing a new value of X while 
T is reading. How2ver, any number of transactions 
can hold a read -lock on X at the same time. 

(2) <A transaction wishing to change the value of logical 
data object X first obtains a write-lock for X aad 
no other transaction can obtain either aread or 
write-lock on the ob ject. 

(3) Messages are sent to all sites holding physical 
copies of data-object X notifying them to change 
their copies to reflect the rew modification before 
releasing the write-lock. 
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(4) The transaction commits and only then all its locks 
are released (Thus we assume two-phase locking to 
assure serializable execution). 

Definition 4.1: The logical data objects® that a transaction 
read is its readset.fhe logical data objects that a 
transaction writes is called the transaction's 
writeset. They will be represanted as readset(T) and 
writeset (T) respectively. 

Note that in particular we do not assume that the 
writeset of a transaction is always a subset of the readset. 
This allows a more realistic model which admit the possi- 
bility that a transaction reads 2 set of objects (the 
readset) and writes a set of objects (the writeset), with 
the option that an object X could appear in either one of 
these sets or both. For 2xample in the transaction: 

READ X; READ ¥; Z= X * Y3; X - Y3 write 2; write X 
the readset is ef and the writeset is Kye 
Definition 4.2: A precedence graph G(V,E) is a directed 


graph, where the vertices (V) correspond «to the set of 


cEansactaons [ ,<semelk  wWlthin a2 schedule S, and the 
edges (E) represent precedenc2? relations between «he 
transactions. 


8See Chapter 2, section B. 
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Definition 4.3: A schedule S for a set of transactions 
T1...TkK is serializable if its precedence graph is 
acyvelic: 

Proposition 4.1: Two transactions Tl and Tj are commutative 
ars 
1) READSET(Ti) is disjoint with WRITESET(Tj) and 
2) WRITESET(Ti) is disjoint with READSET(T}) and 
3) WRITESET(Ti) is disjoint with WRITESET(T¥) 

Proof Outline: The only way in which transaction Ti may 
affect the outcome of Tj is by nodifying shared objects 
in the database and viceversa. Since only the readsets 
are allowed to intersect and read operations are commu- 
tative (the order in which transactions read a shared 
object is unimportant) there is no real interaction 
between the transactions. Therefore changing the order 
of execution produces an equivalent schedule, which 
implies commutativity. 

Definition 4.4: Within one partition schedule we define a 
transaction Ti to be a descendant of transaction Tj if 
READSET (Ti) intersects WRITESET (Tj). 

Definition 4.5: The relatives of a transaction T is the set 
of all transactions that functionally depend on T (i.e. 
the set of all descendants). 

In order to verify the correctness of the approach 
given in the next section we need a formal definition of 
correct partitioned mode operation. We will adopt the defi- 
nition given by Paissol [8]. 
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Definition 4.6: Let S$0 be a_ schedule composed by 
transactions Ti...Tk, such that some transactions in SO 
were succesfully applied in partition 1 and the rest in 
partition 2, resulting respectively in the schedules §S1 
and S2, then correct partitidned mode speration is 
attained if the following conditions are satisfied: 

(1) With the information stored in each partition it is 
possible to construct schedules $3 and S4& such that 
schedules S5 and $6 are both equivalent to the same 
serial execution of SO, where: S5 = (S11, $3) and 
S6 = (S2, S4&). 

(2) No transactions containing irrecoverable external 
actions are rev2rsed by the partition merge 
algoritha. 

(3) All integrity assertions are satisfied after the 
partition merge algorithm is 2xecuted. 

Note in particular that only 2 schedule equivalent 

to some serial execution of SO is required and not a 

schedule equivalent to SO. This may cause different results 

than would have occured if the network was connected, but 
this is usually accepted if serializability is the correct- 
ness criteria. 

Also it is important to not2 that since some trans- 
actions that would have been executed with the network 
connected must be rejected in partitioned operation, SO was 


defined as the schedul2 of transactions successfully 
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executed in partitioned node (this does not m3an that all 
transactions are commited since some of them may be aborted 
after execution because of violation of some integrity 


assertion). 


2. Conflict Detection and DataBase Reconciliation 


Our approach to continuous operation under a network 
partition is based on the use of precedence graphs to detect 
conflicts between partitions at merge time and to help to 
determine a serializable schedule -squivalent to some serial 
execution of the global schedule SO defined in the last 
Section. When a network partition occurs the DBMS within 
each partition performs two actions: first, activates a 
mechanism that aborts transactions trying to execute an 
irrecoverable external action and second, creates 4a 
partition-log which storas information necessary for the 
reconciliation algorithna. The infocmation contained in the 
partition-log consists of the transaction-ID, read and write 
sets of the transaction and the old and new values of the 
updated objects (those in the write set). The transactions 
are recorded in the order in which they commit? within the 
partition, that is, aS a sequence (a total order). When 


communication between partitions is reestablished no nore 


9Note that Tn can execute in a partition only if there 
exist copies within the partition for every data-object in 
its read and write sets. 
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transactions are allowed to be processed until partition 
merge is completed (this restriction is relaxed in section 
C). The partition merge algorithm is then started to recon- 
Ccile the databases. 

Initially the partition reconciliation algorithn 
will construct for each partition 32 precedence graph from 
the information contained in the r2spective partition-log. 
The precedence graph is constructed as follows: 

(1) If transaction Ti r2ads data-»sbject X, and Tj is the 
next transaction (if it exists) to write in X then 
construct an edge from fi to Ij. 

(2) If transaction Ti writes data-object xX and Tj is the 
next transaction to write X then construct an edge 
Erom’ Ti to Tj. 

(3) If transaction Ti writes data-object X and Tj reads X 
before any other transaction writes X then construct 
an edge from Ti to [j. Mark this edge as a descendant 
edge. 

It is important to note that the partition prece- 
dence graph does not have to be constructed at partition 
merge time, but can be constructed gradually as new entries 
are added to the partition-log. In fact, it is better to do 
it this way Since at network reconnection the precedence 
graph will be almost complete and partition nerge ‘time is 
reduced. Also note that 2ach partition precedénce graph is 


going +o be acyclic since the schedules stored in the 
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partition-logs are serializable(they had already been 
executed). 

The next step is to construct a global precedence 
graph which is going to consist of each partition's prece- 
dence graph plus conflict edges between partitions. Since 
transactions were allowed to run "in parallel" in their 
respective partition without coordination between them, the 
conflict edges represent the interaction among transactions 
from different partitions. Therefore, a transaction that 
reads a data-object in on? partition must precede any trans- 
action that writes that data-object in the other partition 
to mantain consistency. $3 a conflict edge from Ti to Tj is 
constructed if transaction Ti in one partition reads or 
writes data-object X and transaction Ijin the other parti- 
tion writes X. 

Once the global precedence graph is constructed a 
topological sort is executed on the graph and if a cycle is 
found, one of the transactions involved in the cycle (the 
one with less descendants) and all of its descendants are 
rolled-back in the partition where they were executed. The 
entry in the partition-log corresponding to each rolled-back 
transaction is send to a re-execution list. Tf a node can 
be extracted by the topological-sort then the values of the 


objects updated by the transaction represented by the node 
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are forwarded!9 to the other partition and the corresponding 
entry in the partition-log is deleted. The process is 
repeated until all transactions in the precedence graph had 
been forwarded to the other partition or send to the 
re-execution list. That 1s we have no more entries in the 
partition-logs. 

The transactions in the re-2xecution list are then 
executed in both partitions and if any violation of integ- 
rity assertions occurs the transaction is rolled-back and 
its entry in the re-execution list is deleted. This can 
happen because we have altered th2 order in which non- 
conmutative transactions were executed. When the algorithn 
terminates we are going to have a consistent database 
throughout the network. 

After the brief discussion of the approach taken we are now 
ready to present the merge algorithn. 
Algorithm MERGE 
(1) Send message "partition merg2 in progress" to each 
Dart 1 e1on. 
(2) Construct the precedence graph for each partition 
extracting information from their respective 
partition-log. 


(3) Repeat steps (4) to (5) for 32ach partition. 


10Actually, only the updated values of copies of data- 
objects that also exist in the other partition are 
forwarded. We will refer to this every time the word 
forwarded is used. 
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(4) Repeat step (5) for each entry in the partition-log 
Starting at the first one. 

(5) Compare the readset of this entry with the read and 
write sets of every 2ntry in the partition-log of the 
other partition. Any time a match is found adda 
directed edge (from the transaction in this entry to 
the transaction in the other partition's entry) to the 
global precedence graph. Mark this edge as a conflict 
edge. 

(6) Run the TOPOLOGICAL-EXEC algorithm on the global 
precedence graph until all nodes on the graph had been 
deleted. 

(7) Execute algorithm RE. 

(8) Send message "merge completed" to each partition. 

(9) Terminate. 

At end of the merge algorithm the global eroc eianes 
graph and all entries in both partition-logs will have been 
deleted. We now present the supporting algorithms 
topological-exec and RE. What the TOPOLOGICAL-EXEC algor- 
ithm basically does is 2 topological sort on the global 
precedence graph to obtain a serial schedule for the trans- 
actions in both partitions. A topological sort generates a 
linear ordering with the property that if Ti is a pred- 
ecessor of Tj in the graph then Ti precedes Tj in the linear 
order. A linear order with this property is called a *opo- 


logical order [13]. 
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Since a linear order is serial by nature a topolo- 
gical order gives a serial order which satisfies the 
precedence relations between transactions. 

It is important to note, eoeaeer: that a topological 
order can be obtained only if the global precedence graph is 
acyclic and thus if there is a cycle it must be removed from 
the graph before the topological sort can continue. 
Algorithm TOPOLOGICAL-EXEC uses alyorithm REMOVE-CYCLE in 
order to remove one of the edges that form the cycle to 
obtain an acyclic graph. Every tine the TOPILOGICAL-EXEC 
algorithm is able to extract a node from the graph it 
forwards the updates made by the transaction (contained in 
the partition-log entry) that corresponds to the graph node 
to the other partition. W2 now present the algorithn. 
Algorithm TOPOLOGICAL-EX EC 

(1) Repeat steps (2) to (6) for each node in the global 
precedence graph. 

(2) If every node has a predecessor then execute algorithm 
Remove-Cycle and go to (1). 

(3) Pick a node which has no predecessors. 

(4) Forward the updated values of the data-objects modi- 
fied by the transaction (contained in the 
partition-log entry) that corresponds to the selected 
node to the other partition. 


(5) Delete the entry from the respective partition-leg 
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(6) Delete the node and all edges leading out of the node 
from the global precedence graph. 

(7) Terminate. 

As we indicated before, any time a cycle is found 
the algorithm Remove-Cycl2 is invoked to remove one of the 
nodes involved in the cycla. This maans that the effects of 
the transaction contained in the entry that corresponds to 
the node and the effects of all the relatives of the trans- 
action must be removed from the database. In order to avoid 
extensive roll-back of transactions 3s much as possible the 
transaction chosen to be removed will be the one with less 
relatives. The transactions will ba rolled-back in inverse 
order of execution and their entries in the partition-log 
will be moved to a re-execution list to be executed again 
laters We now present the algorithm. 

Algorithm REMOVE-CYCLE 

(1) Repeat step (2) for each node related tc another by 4a 
conflict edge in the precedence graph. 

(2) Compute the number of relatives of the node by 
counting the descendant edges that go out either of 
the node or its descendants. 

(3) Choose the node with less number of relatives and 
create a relative sat containing all the relatives of 
the node. If there 1s more than one node with the same 
number then choose the one involved with more conflict 


edges. 
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(4) Move the partition-log entry corresponding to the 
selected node to a roll-back list and repeat step (5) 
for each following antry until the relative set is 
empty. 

(5) If the entry corresponds to a node in the relative set 
then move the entry to the roll-back list and delete 
the node from the set. 

(6) Repeat steps (7) to (9) for each entry in the roil- 
back list starting with the last entry and going 
backwards until the list is empty. 

(7) Use the system supplied UNDO speration t> remove the 
effects of the transaction, corresponding to the entry 
by placing the "before" values of the updated objects 
in their correspondent partitions. 

(8) Move the entry in the roll-back list to the 
re-execution list. 

(9) Delete the node that corresponis to the 2ntry and all 
edges to or from th? node in the global graph. 

(10) Terminate. 

At the end of algorithm TOPILOGICAL-EXEC there is a 
re-execution list which contains all the transactions from 
both partitions that were rolled-back in order to mantain a 
global consistent databas2 state. These transactions are to 
be rerun in both partitions by the algorithm RE. DE ay Rats tila 
case integrity violations can occur since we have changed 


the execution order of transactions that are noncommutative. 
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Note that when algorithm TOPILOGICAL=EXEC terminates 
the databases of each partition are in a consistent stats, 
that is, they are mantaining mutual consistency since the 
same transactions had been executed in both partitions. We 
now present algorithm RE. 


Algorithm RE 


ct 
a 
(D 


(1) Repeat steps (2) to (5) for each entry in 
re-execution list. 
(2) Run the specified transaction in both partitions. 
(3) If any integrity assertion is violated reject the 
transaction. 
(4) Delete the current re-execution list entry. 
(5) Terminate. 
We proceed now to show the correctness af the approach. 
Proposition 4.2: Algorithm MERGE correctly reconciles a 
database that has been independently modified by trans- 
actions in different partitions. 
Proof: Let SO be the schedule in the whole system with S1 
and S2 executed in partition 1 (PR1) and partition 2 
(PR2) respectively. We must prove that each of the 
requirements for correctness jefined in section 5B, 
subsection 1 are satisfied when algorithm MERGE is 
executed. In order to make the proof more understand- 
able we consider three cases according to the initial 
configuration of the global precedence graph 


constructed by steps (1) to (5) from algorithm MERGE. 
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Case 1: global precedence graph with no conflict 
edges. This means that all transactions in one parti- 
tion are commutative with all transactions in the other 
partition. Step (6) executes algorithm TOPILOGICAL-EXEC 
which will obtains a topological order of transactions 
from both partitions and will forward the values of 
objects updated by transactions in one partition to the 
other. In this case the resulting schedules of trans- 
actions executed in PR1 and PR2 will be equivalent to 
the global schedule S02 and not only to a serial execu- 
tion of it. This is jue to the fact that transactions 
in PR1 are commutative!! with transactions in PR2 and 
viceversa and they can be executed in any order?2 
without affecting their results. 

Case 2: Global precedence graph with conflict edges 
but without cycles. In this case the graph is also 
serializable. A conflict edge represents the fact that 
for the same logical data-object with physical copies 
in different partitions, a transaction in one partition 
read the value of a sopy of this data-object while in 
other partition a transaction updated the value of the 


copies of the data-object. 


= @beap 422 <2 22 op 62 aes =) 422 ee 22 =e 24 = = =e 


tisee definition of commutativity in section B. 


t2The order in which they were executed in their own 
partition must be preservei. 
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Algorithm TOPOLOGICAL-EXEC by step (3) will make surs 
that the transactions which read an object forward 
their updated objects before transactions that write 
the object in the oth2r partition. The rest of trans- 
actions are commutative with the transactions in the 
other partition so they are no problem.!3 However.thse 
resulting schedule executed PR1 and PR2 may not be 
equivalent to SO but only to a serial execution of it, 
namely, the one produced by algorithm TOPOLOGICAL-EXEC. 
Case 3 : Global precedence graph with cycles. In this 
case the graph is not serializable. Cycles must be 
removed to obtain a s2rializable schedule. Step (2) of 
algorithm TOPOLOGICAL-EXEC will detect cycles and 
remove all offending transactions and its relatives 
uSing algorithm REMOVE-CYCLE. Removed transactions are 
sent to the re-execution list. Once the graph has no 
cycles we are again incase 2. Values of objects 
updated by transactions are forwarded to the correspcen- 
dent partition in topological order by step (4%) of 
algorithm TOPOLOGICAL- EXEC. Immediately before step (7) 
of algorithm MERGE, the schedules in both partiticns 
are equivalent with 2ach transaction not removed from 
the graph executed in both sides and ‘transactions 


removed from it in tha re-execution list. Step (8) will 


neapbaep Gea = ae ae ava at @ae ae apa ae = «eb 


13Same as in case 1. 
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then execute algorithm RE, which reruns the 
transactions that were removed in both partitions in 
the same order, checking integrity asserticns. If some 
integrity violation occurs at this point,the trans- 
actions are aborted in both partitions. At this stage 
every transaction in the global schedule SO except 
those with integrity violations had been executed in 
both partitions. Thus, the resulting schedules are 
equivalent to some serial execution of SO, namely, the 
one produced by algorithm MERGE. Note that this 
schedule will be composed by transactions executed in 
topological order by algorithm TOPOLOGICAL-EXEC, plus 
transactions rerun by algorithm RE minus transactions 
Wooleintegriny CONfLIctS.  (COrrectness condition (1) is 
Satisfied because in each of the three cases we are 
going to have at least a schedule equivalent to some 
serial execution of $0 in both partitions. Condition 
(2) is satisfied because no irrecoverable external 
actions are allowed. Condition (3) is also satisfied 
because transactions that violate integrity assertions 
are aborted. 
As we can see the merge algorithm is somewhat 
complex. This is due to our interest in allowing more avail- 
ability of data and to our concern in trying to avoid as 


much transaction roll-back as possible. 
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3. An Example 


In order to make clear how the approach works we 
present in this subsection an example. 
mee S11 = T11, T12,T13 be the schedule of transactions 


executed in partition 1 (P81) where: 


Readset(T11) = x,y,z ; Writeset (T11) = x,z 
Readset(T12) = u,v,Z,p 3; Writeset(T12) = Vv 
Readset(T13) = p,q : Writeset (T13) = p,q 


and $2 = 121,1722,T23,T24 the schedule executed in PR2 where: 


Readset(T21}) = q,u,r =; Writeset (T21) = u,r 
Readset(T22) = 1,n,n ; Writeset (T22) = l,m 
Readset (T23) = u,w : Writeset (T23) = w 
Readset(T24) = w,y,z ; Writeset(T24) = y 


When the partitions find out that they can communicate 
algorithm MERGE is started. Step (2) of this algorithm wiil 
construct the precedence graph of 2ach partition with the 
information stored in their respective partition-log. Steps 
(3) , (4), (5) of the algorithm will construct the global 
precedence graph by adding the conflict edges to the 
existing graph. Figure 4.1 shows the global precedence graph 
constructed. 

Once the global precedence graph is constructed step 
(6) will call algorithm TOPOLOGICAL-EXEC to obtain a topolo- 
gical order of the nodes in the graph. Step (3) of algorithn 


TOPOLOGICAL~-EXEC will selact the node corresponding to T22 
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Figure 4.1 Initial precedence graph 


Since it is the.only node without a predecessor. Step (4) of 
this algorithm will forward the value of the objects updated 
by T22 (i.e. 1,m) to partition 1 (PR1). Steps (5) and (6) 
will delete the entry in the partition-log that corresponds 
to that node and will delete the node from the graph respec- 
tively. Figure 4.2 shows the state of the graph after the 
deletion. 

Step (2) of algorithm TOPOLOGICAL-EXEC will deter- 
Mine that all remaining nodes have 32 predecessor so a cycle 
exists. Algorithm REMOVE_CYCLE is than invoked and steps (1) 


and (2) of this algorithm count the iescendants of each node 
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Figure 4.2 First modification to precedence graph 


related to another by conflict edges. Ti2 and T24 have the 
same number of descendants (0 descendants) but T24 is 
involved with two conflict edges so step (3) of this algor- 
ithm chooses T24 to be rolled-back and creates an empty set 
of descendants. Step (4) moves the partition-log eéntry 
corresponding to the selected node to the roll-back list and 
step (5) is skipped since the node has no relatives. Steps 
(7), (8), (9) remove the affects of the transaction from the 
database by using the UNDO operation, move the entry corre- 
sponding to T24 to the ra-execution list, and delete the 
node and edges to or from it from the graph respectively. 


Figure 4.3 shows the new state of the precedence graph. 
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Pigure 4.3 Second modification to the precedence graph 


Algorithm TOPOLOGICAL-EXEC reassumes execution and by step 
(3) picks 11 since now it has no predecessors.Step (4) 
forwards the values updated by T11 t5 PR2 and steps (5) and 
(6) delete the entry corresponding to T1171 and the node in 
the graph respectively. Figure 4.4% shows the remaining 
graph. 

Successive applications of step (3), (4), (S)P speck 
maze 8 T21, Testes Un that Onder (T23. and £13 ‘could be 
picked in any order); seni the updates of the transactions 
to the partition where they did not execute and, delete the 
respective entries from th2 partition-log and nodes from the 
graph. Figures 4.5 show the next state cf the precedence 
graph. 
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Figure 4.4 Third modification to the precedence graph 
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Pigure 4.5 Fourth modification to the precedence graph 
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Now algorithm MERGE reassumes execution and by step 
(7) it executes algorithm RE. The only entry in the 
re-execution list was T24 so this transaction will be 
executed in both partitions.If any integrity assertion is 

iolated the transaction will be aborted by step (4) of this 
algorithm and by step (5) the antry is deleted from the 
re-execution list. 

Note that the schedule of transactions executed in 
both partitions is now equivalent to a schedule executed in 
topological order.This is due to the fact that sending the 
updates made by any transaction in PR1 to PR2 is equivalent 
to executing that transaction before any transaction in PR2 
that writes in those objects and vicesversa. Thus the equiva- 
lent topological order of execution in both partitions will 
meee, Tilpoetli2,. ~T21, T23, =T13, and 9245 2frt- 1s Hot 


aborted. 


C. EXTENSIONS TO THE APPROACH 


Section B presented the approach we proposed to allow 
the operation of distributed database systems under network 
partitions. In order to simplify the description of the 
systems operation andof the merge algorithm, we made 3 
humber of restrictions and promise to relax them later. 


This is the purpose of this Section. 
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Subsection 1 presents a Ageeneaon of the locking 
requirements and associate modifications to the merge algor- 
ithms that will allow normal operation while a partition 
merge is in progress. The main objective of this section is 
to reduce the delay to which incoming transactions would be 
subjected while the merge algorithm is in progress. This 
delay can be substantial for partitions of long duration or 
for database systems with high activity rates. 

Subsection 2 presents a discussion of partial partition 
merges when there are more than two partitions. Sites may 
become partitioned and then join again in various orders. 
The eaSiest solution would be to wait until the network is 
completely reconnected to perform partition merges. This 
would not only increase the degr2e of inconsistency that 
must be reconciled later but also would increase the over- 
head and time required for partition merge. 

FPinally subsection 3 presents 2 discussion of the situ- 
ations in which irrecoverable axternal actions can be 
allowed when network is partitioned. The main objective of 
this subsection is to increase user's availability +o data 
in the database, so that a higher number of transactions can 


be executed during the partition. 


om 


1. Normal operation During Partition ! 


erge 


The merge algorithm described in section B, subsec- 


tion 2 assumed that no transaction was allowed until the 
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algorithm was completed. This assumption relieved us from 
worrying about interference from other transactions and 
locking issues in the description of the algorithn. 

This subsection presents the locking requirements 
necessary to allow normal operation while the merge algor- 
ithm is in progress. We will see that these requirements 
are relatively simple, despite the fact tha* the algorithn 
1s somewhat complex. 

Normal operation during partition merge can be 
allowed if the new transactions do not interfere with trans- 
actions being reconciled by the merge algorithm That is we 
need the new transactions to be commutative with all the 
remaining ‘transactions in each partition-log in order to 
execute them in normal mode. Otherwise new conflicts will 
arise that could not be resolved. 

To assure that the new transaction is commutative we 
need to compare its read and write set with the read and 
write sets of all transactions still in the partition-logs. 
If no match is found then we know it is commutative and can 
be executed without problem. However if there is a match 
then the transaction will have objects in its read and write 
set that are yet to be reconciled and so :% must be delayed 
to avoid new conflicts. 

Note that even if the new transaction is commutative 
there may be a Significant delay before it can be executed 


Since we are introducing additional overhead in order to 
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compare its read and write sets with the ones of the trans- 
actions that remain to be reconciled. 

However, we can attempt an optimization if we use 
the idea of a data-object log (DO log) [4] to store informa- 
tion about the status of the object. The information we 
need to store is just a "mark" that indicates that «he 
object was used by some transaction in partitioned mode. [In 
order to accomplish this we need to establish the policy 
that while in partition mode operation the first time an 
object is used by any transaction a Data-object-log is 
created and a value (i.e. 0) ais stored in it. After this 
every time a transaction uses the object the value in the DO 
logj is incremented (i.e. by 1), this process stops when the 
partition merge algorithm initiates its execution. A small 
modification should be made to the MERGE algorithm. Every 
time an entry is deleted from a partition-log or from the 
re-execution list, the value stored in the Do log of each 
object used by the transaction that corresponds to that 
entry is decremented by 1 in the partition where the trans- 
action executed while in partitioned mode. If when deleting 
an entry the value stored in the DO log is 0 then the DO log 
is deleted. 

In that way new transactions operating in normal 
mode that are willing to use an object just have to check in 
each partition if the object has an associated Do-log and if 
so then the transaction is delayed until the Do-log of the 
object is deleted in every partition where it existed. 
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We can see that the overhead involved is much 
smaller than the one we need to compare the read and write 
sets with allof the transactions being reconciled and so 
the delay should be considerable decreased. Also the over- 
head imposed by the MERGE algorithm with this addition is 
not very significant and the additional storage required is 


rather small. 


2. Partial Partition Merges 


In the presentation of our approach we assumed that 
only two partitions existed. However, this may not be the 
case and although it should be quite infrequent there may be 
more than two partitions that can join in different orders 
depending on which communication Lines are reestablished 
fest . This section relaxes the assumption that only two 
partitions exist and discusses how to deal with the problen 
of partial partition merges. 

AS we mentioned before a straightforward, but simple 
minded, solution could be to wait until the network is 
completely reconnected and then start the partition merge 
algorithm involving all partitions at the same time. However 
this solution has serious disadvantages such as having more 
restricted operation within a partition for more time and 
having a significantly increased overhead in order to merge 


all the partitions at the same time. 
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An alternate solution could be to allow partial 
partition merges to occur in different orders without any 
restriction and as soon as two partitions discover that they 
Can communicate between them. This can be done because we 
have inthe partition-193 enough information (names and 
values of objects in read and write sets) +o avoid repeated 
updates to the same object and out of order execution of 
transactions in different partitions. However this will 
require an extensive comparison of partition-logs and addi- 
tional lists to store th2 entries that were removed ina 
previous merge in order ts see if it is required to remove 
these entries from anew partition joining the existing 
Partition. 

An example can clarify these concepts. Suppose we 


have the partition graph shown in figure 4.6 . Initially the 


network partitions forming two grdups, the £27St s¢roup 
contains only N3 and the second group N1 and N2. Each 
partition is assigned 42 unique ovoartition-ID which is 


included in all entries maie to their respective partition- 
logs. Later, another partition occurs resulting in N1 and 
N2 working separately. 

Again a unique partition-ID is assigned to these 
partitions and every entry to the partition-log from now on 
will have the new partition-ID. Note that each of the new 
formed partitions N1 and N2 ‘tinherits" the partition-log 


entries of the past partitions, mantaining these entries 
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Figure 4.6 Partial merges in a partition graph 
them: Original partition-ID. When N2 discovers that it can 


communicate with N3 they start the merge algorithm +o recon- 
cile their databases. However, no entry of their 
partition-logs can be deleted since they will be needed to 
compare new merges if the transactions corresponding to 
those entries have been executed before i.e. when N1 and N2 
were in the same partition. We must also preserve entries 
in the roll-back lists because if 2antries corresponding +o 
N1 when it was in the same partition with N2 are rolled- 


back, then when these two sites are reconnected again we 
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must roll-back these entries from N2 also. Similar precau- 
tions will have to be taken with entries in the re-execution 
list, where some transactions that were originally executed 
in (N11, N2) will not be able to be re-executed in (N2, N3) 
if some object is present only in N1, so those transactions 
should be delayed in their re-execution until N1 joins (N2, 
N3). As we can see the required protocols to make possible 
partial merges in arbitrary orders would be pretty involved 
and the high overhead would make them impractical. Thus, 
although it is possible to allow partial merges in different 
orders we will not pursue this solution because of its 
complexity. 

A far more practical solution could be obtained if 
we restrict partial merges in such a way as to allow only 
symmetric merges, that is, if we require that the partition 
graph be a symmetric direct acyclic graph. Figure 4.7 shows 
the way in which merges would be 2xecuted if we comply with 
this restriction. AS we can see subgrahps are symmetrical, 
so partitions merge in the same order in which they wera 
partitioned. 

Having this merge pattern the only modification we 
need to introduce in the MERGE algorithm is that we need to 
retain the entries in th2 new partition-log. Note that 
these entries will be stor2d in topological order, that is, 
om. the order) in which the topslogical-exec algorithn 
executed the transactions in both partitions. Entries can 
be deleted when the sink node is reached. 
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Figure 4.7 Symmetric partial merges in a partition graph 


We now present the modifications required in order 
to allow partial merge in a symmetric directed acyclic graph 
(DAG). Recall that algorithm MERGE uses algorithna 
Topological-Exec to delete the entries from the partition- 
logs so the modification will be in this algorithm. 
Modification to algorithm IT OPOLOGICAL-EXEC. 

(5) If the sink node in the partition graph has been 
reached then delete the entry from the respective 


partition-log, else store the entry in the new 


partition-log. 


75 





We can now proceed to show the correctness of the 
modification made to the algorithn. 

Proposition 4.4; If the partition graph is a symmetric DAG 
then partial partition merges are correctly executed by 
the modified MERGE algorithn. 

Proof Outlines Since the partition graph is a symmetric DAG 
each sub-graph represents a Situation which is the same 
as the one for the original MERSE algorithn. That is, 
each subgraph will consist of a subsource node anda 
subsink node that are the same and thus no duplicate 
updates or out of order 2xecution of tranSactions may 
CECur. With the modification in step (5) of algorithn 
topological-exec, parctition-log entries are retained 
af+er they have been used to reconcile their respective 
databases. Thus,thes2 entries will be available to be 
applied in the other subgraphs. There is no problen 
with transactions that are undone since they will have 
been executed only in the current subgraph and since no 
other subgraph was involved they do not need to be 
undone elsewhere. The same is true for transactions 
that Violate integrity assertions when being 
re-executed. Thus we have no problem in deleting the 
corresponding entries from the partition-log. Once the 
Sink node of the entire graph is reached, we have the 
entire network reconnected and the same Situation as in 


the original MERGE algoritha. When this occurs, the 
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entries in the partition-logs can be deleted since they 
will have been applied to all sites. 

As we can see this 2emics on requires that partial 
merges be done in a symmetric way. Obviously this will 
delay some partial merges since we have to wait until parti- 
tions in the same subgraph can communicate between them. As 
amatter of fact there is going t> be more to reconcile 
afterwards since partitions will continue operating in 
partitioned mode. However this solution is 2 compromise 
between the two solutions that were mentioned first and we 


think it is a reasanable one. 


One of the asSumptions made in section B was that no 
lrrecoverable external actions were allowed and transactions 
that attempted to execute one of these actions were aborted. 
In this subsection we anlyze in which circumstances irrecov- 
erable external actions can be allowed. 

Faissol (8] proposed a solution to this problem!* by 
determining those integrity assertions that could be parti- 
tioned in such a way that if they were not violated in any 
partition then they would not be violated as 2 whole when 
the network is completely connected.Thus irrecoverable 


external actions that involved objects with these type of 


t#See Chapter 3,section C. 
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integrity assertions could be allowed under network 
partition. This solution can also be implemented in the 
approach presented in this chapter by giving the DBMS enough 
semantic information about integrity assertions dealing with 
external actions. 

An alternate solution would be to allow irrecover- 
able external actions in at most one partition. Tne kace, 
there is no reason why we could not do this. In our approach 
integrity assertions for 2ach partition are the same as the 
ones when the network is completely connected. The partition 
to be chosen to allow these kind of actions could be deter- 
mined by one of the methods proposed in Chapter 3, section B 
Cle. voting). Precedence should be given to partitions 
where it is more likely that external actions will occur. 
For example if after a partition in a bank system 80% of the 
automatic teller machines are in on? partition, then this 
partition should be allowed to execute irrecoverable 
external actions (i.e. cash dipensing). This solution has 
the advantages that it does not require extra integrity 
assertions and that users which can access the selected 
partition will have no restriction at all with respect to 
external actions. 

However, we have to adopt some special measures to 
avoid transactions that execute irrecoverable external 
actions to be undone at partition m2rge. A way to assure 


this is to "mark" those transactions as "permanent" in the 
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partition-log and so even if it is involved in a cycle and 
is the transaction with less descendants it will not be 
chosen to be rolled-back. Another precaution should be taken 
to avoid that at partition merge irrecoverable external 
actions be executed again. This is achieved directly by the 
proposed approach because transactions are not executed in 
the other partition,but only the values of the modified 


objects are forwarded. 
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In this thesis we analyzed th2 problem presented by 
net work partitions on distributed database systems, with 
replicated data. Because of the n2ed to preserve mutual 
consistency and to avoid irrecoverable external actions 
which may be performed by transactions operating on incon- 
sistent data, it is impossible to allow unrestricted 
operation under network partitions. Consistency and availa- 
vility just appear to be fundamentally incompatible goals. 

Existing solutions allowing one operating partition 
totally block update transactions in all but one partition. 
In this way mutual consist2ncy is guaranteed when databases 
from different partitions are merged. Howevwer, these solu- 
tions are not acceptable for many existing military and 
commercial applications which require high availability. 
For example, an airline reservation system will prefer to 
conditionally reserve a seat on a flight for a customer 
rather than telling him to wait until the partition is 
repaired. 

The two recently proposed solutions which allow multiple 
operating partitions gr2atly increase avallability of 
distributed systems allowing them to utilize more fully the 


potential improvement provided by redundant data. The method 
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consisting of the version vector mechanism together with the 
log filter is very simple in structure and involves the 
addition of only a few new constructs to the file systen 
design, namely, file origin points, version vectors, and the 
log filter. Thus this approach requires very little systen 
overhead. However, the approach is specially suited for an 
environment in which file updates are moderate and conflicts 
occur only rarely and thus it will probably be not very 
useful in the kind of environment characterizing a database 
system with high transaction rates and volatility. 

The approach involving semantic knowledge is based on 
the addition of semantic constraints to those already 
existing within a particular application. These constraints 
are enforced by the DBMS through the use of strong data 
types and integrity assertions. The use of semantic infor- 
Mation about data assures that conflict detection and 
database reconciliation can be performed when the network is 
reconnected. This is perhaps the best existing solution to 
the network partition problem since it allows the highest 
degree of availability through the ase of diffarent classes 
of semantics of operations. Therefore, in most of the cases 
the database can be reconciled without +he necessity of 
rolling-back transactions which had been executed in parti- 
tioned mode in order to achieve mutual consistency and thus 
the user will feel confident that aven in the event of a 


network partition his transactions are going to be executed 
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giving reliable results. However, this approach lacks gener- 
ality since the use of data types restricts access to the 
database toa set of limited operations. Also for each 
different application semantic information about the opera- 
tions used in each them must be given to the DOBMS in order 
to correctly reconcile the database. Clearly the overhead 
incurred by the reconciliation algorithms and the extra 
information required will increase proportionally with the 
number of applications processed by 31 systen. 

The approach proposed in this th2sis assumes no semantic 
knowledge about the data and thus is more gen2ral since no 
predefined operations are required and several applications 
will not increase the amount of information necessary for 
the reconciliaton algorithm. It can be argued that a serious 
disadvantage of this method is that the way in which mutual 
consistency is achieved is by rolling-back conflicting 
transactions and then reexecuting them again and thus final 
results may be different from the on2s obtained by the users 
when the network was partitioned. However, rolling-back 
transactions should not be the rule but the exception since 
in a large class of applications most «ransactions will 
never interfere with each other [5],{6]. Also in ‘the 
uncommon case in which a conflict is detected the reconcili- 
ation algorithm will roll-back transactions only as a last 
resource when copies in different partitions of the same 


logical data-object had bean independently updated. Even in 
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this case it is attempted to minimize the number of trans- 
actions that need to be rolled-back in one partition by 
choosing the transaction with less descendents. 

Further research is needed in this area in order to 
determine which is the best method for dealing with network 
partitions for different applications. Perhaps a combination 
of semantic knowledge with the approach presented in the 
thesis will be the most appropriate for some applications. 
For example, in many commercial ani military applications 
class A semantics is the most frequent [8] and since it is 
the class of semantics with less associated overhead, it can 
be used together with the alternate approach presented here 
in order to reduce the amount of semantic information 


required by the DBMS and thus reducing the overhead. 
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